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1 Administrivia 

You should probably know that 

• the first problem set (due October 15) is posted on the class website, and 

• its hints are also posted there. 

Also, today in class there was a majority vote for posting problem sets earlier. Professor Kelner will post 
the problem sets from two years ago, but he reserves the right to add new problems once a problem set has 
already been posted. 

Questions from last time. 

• What is a level set? The level set of a function corresponding to a (fixed) constant c is the set of 
points in the function's domain whose image equals c. 

• What is a good reference on applications of expander graphs? A course taught by Nathan Linial and 
Avi Wigderson [3]. 

Plan for today. We use what we proved last time to obtain a local clustering algorithm from a random 
walk scheme. Then, noting that similar results to the ones proved last time also hold for PageRank, we 
obtain a second scheme that yields a second, better local clustering algorithm. Finally, we briefly motivate 
the technique of sparsification, which we will discuss next time. 

2 Local and Almost Linear-Time Clustering and Partitioning 

2.1 Review of Local Clustering 

Let us briefly review local clustering, which we introduced last time. Given a vertex v in some graph G, we 
would like know if v is contained in a cluster, i.e. a subset of vertices that defines a cut with low conductance. 
However, we want the running time of our algorithm to depend on the cluster size, and not on the size of 
the graph. Last time we mentioned that a good example of a problem of this sort is trying to find a cluster 
of web pages around mit . edu; we surely do not want the running time of this task to depend on the number 
of sites created on the other side of the world. Let us make our goal a little bit more precise: in this lecture 
we will describe an algorithm that, after running for time almost linear in K , outputs a cluster of size at 
least K/2 around the starting vertex, if such a cluster exists. 

2.2 General Strategy 

We observe that if we run a random walk starting from some vertex v contained in a cluster, then low- 
conductance cuts will be an obstacle to mixing; i.e., the random walk has trouble leaving the cluster. Hence, 
a good guess for the cluster is the set of vertices with the highest probability masses after a given number 
of steps (of a random walk that started at v). Last time we showed that this makes sense by proving the 
Lovasz-Simonovits theorem [4]. 
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Therefore, a good primitive to construct an almost linear-time global algorithm is the following. Run a 
random walk starting from v, and, at each step, for every vertex w, approximate the probability that the 
random walk is at w; then take the vertices with the k largest probability masses as a possible cut. Repeat 
this until you get a good cut or you reach a predetermined limit. 



2.3 Obstacles 

We need a bound that says that our general strategy works, and that is why we proved the Lovasz-Simonovits 
theorem. However, the bound we have is global, i.e., it involves the conductance <j>(G) and we do not have 
the time to compute A2 for the whole graph to bound the conductance. Moreover, if we exactly compute all 
the probabilities of the random walk, it will take too long. Finally, even if we approximate the probabilities, 
we would need a stronger bound, and the goodness of the approximation depends on the cluster size, which 
we do not know in advance. 



2.4 One Solution 

A reasonable solution goes as follows. We recall that the proof of the Lovasz-Simonovits theorem that we 
discussed last time used cuts on level sets of p*. This implies that if a walk does not mix too quickly, 
we know that one of the cuts had bad conductance. Therefore, obtain the following corollary from the 
Lovasz-Simonovits theorem. 

Corollary 1 Let G = (V, E) be a connected, undirected graph with m edges and let n(x) be its stationary 
distribution d:e d . For every subset of vertices W C V and and every time t, if x = J2w<=w and <p(W) 

is the conductance of the cut (W, W), then the following inequality holds: 



^2 p\w) - tt(w) 



< min 



Note that in the last lecture we stated a slightly weaker form of the theorem, where the conductance 
ip(W) of the cut (W, W) was replaced by the conductance 0(G) of the whole graph. Nevertheless, we did 
actually prove the stronger version stated above. 

The bound above has nothing to do with global properties of the graph. Therefore, we can use Corollary 1 
for local clustering in the following way. If after O ( ( lo ^ m ) ) steps a set of vertices contains a constant factor 
more than what it would have under the stationary distribution, then we can get a cut C such that <p(C) < </>. 
(The cut can be obtained by mapping the probabilities to the real line and cut like we did with v-i a few 
lectures ago). 

A problem with this approach is that computing all the probabilities will be too slow. In particular, after 
only a few steps we will have too many nonzero values to keep track of. Lovasz and Simonovits proposed to 
simply zero out the smaller probabilities and then prove that it does not hurt much to do so. However, the 
analysis is really messy. Instead, Andersen, Chung, and Lang [1] propose an approach that, instead of using 
the probability vector of a lazy random walk, uses a slightly different vector called PageRank; we discuss 
this approach in the following section. 

(Note that for all of this to work we still need to prove a partial converse. Indeed, one can show that if 
there exists a cut C of conductance <fi 2 , then at least \C\/2 of its vertices will give a cut of conductance <fi, 
otherwise the random walk would mix too quickly.) 
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3 PageRank 



3.1 Definition 

Consider an undirected 1 connected graph G = (V, E). Recall that a simple random walk on G is a walk that, 
starting at some initial vertex, at each step moves from the current vertex to a randomly chosen neighbor 
of the vertex; a lazy random walk on G is a walk that, starting at some initial vertex, at each step with 0.5 
probability stays on the current vertex and with 0.5 probability moves from the current vertex to a randomly 
chosen neighbor of the vertex. 

We now consider a new Markov process that is a modification of a lazy random walk on a graph. Fix 
some distribution s over the vertices V of G and fix a parameter a € (0, 1) (called the teleport probability). 
Starting from some initial vertex, at each step of the process we do the following: with probability 1 — a we 
take a step of a lazy random walk on G, and with probability a we "teleport" to a vertex drawn from s. For 
simplicity, we will take s to be a single vertex, i.e., all the probability mass is concentrated on one vertex. 

The process converges to a stationary distribution (because it corresponds to an aperiodic, irreducible 
Markov chain). For consistency with [1], we denote this stationary distribution (which depends on the 
parameters s and a) by pr Q (s) and call it the PageRank vector; note that pr a (s) is a vector in R n , where 
n = \V\. Moreover, it is easy to see that the stationary distribution pr Q (s) is the unique solution to the 
following equation: 

Pr„00 =as+(l-a)Wpr a (s) , (1) 

where W is the transition matrix corresponding to a lazy random walk on G. 

The point is that one can show that the Lovasz-Simonovits theorem and its corollary hold for the 
PageRank vector pr Q (s), where s corresponds to the starting vertex and a corresponds to the number 
of time steps. Hence, rephrasing the discussion in Section 2.4, we know that if a subset of vertices S contains 
more than a constant factor more probability under pr Q (s) than under the stationary distribution, then we 

can find a cut with conductance 0(^Jalog^2 v(£Sd ^). Moreover, approximating the PageRank vector pr Q (s) 

is robust under small errors, because it is the solution of an equation rather than being the result of many 
successive computations each with approximations. 

Next, we prove some properties about the PageRank vector and then show how to approximate it. 

(Note that, just like before, we still need to prove a partial converse. Indeed, one can show that if there 
exists a cut C of conductance a, then at least |C|/2 of its vertices will give a cut of conductance 0(yfa)). 

3.2 Properties 

We now prove three properties about the stationary distribution pr Q . 
Proposition 2 (Uniqueness) pr a (s) is unique. 

Proof We must show that Equation (1) has a unique solution. Rewrite the equation as (J — (1 — 
a)W)pr a (s) — as. The matrix I — (1 — a)W is strictly diagonally dominant 2 because the off-diagonal 
elements in each column add up to 1/2, while each diagonal element is 1 — (1 — a)(l/2). By the Gershgorin 
circle theorem [2], it must be nonsingular, so that the equation has a unique solution. ■ 

Proposition 2 allows us to extend the definition of PageRank: given any vector s €E not necessarily a 
probability distribution over the vertices of the graph, we define pr a (s) as the unique solution of Equation (1). 

Proposition 3 (Linearity) pr a (cv + dw) = c ■ pr a (v) + d ■ pr a (w). 

1 Google uses the directed version, because hyperlinks "go only one way". 
2 A matrix is strictly diagonally dominant if an > Sj^i \ a ji\ f° r a ^ 
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Proof By definition, the vector x = pr a (cv + dw) satisfies the following equation 

x = u(cv + dw) + (1 — a)Wx . 

Let us verify that x' = cpr a (v) + dpr a (w) satisfies the same equation: 

a(cv + dw) + (1 — a)Wx' — a(cv + dw) + (1 — a)W(cpr a (v) + dpr a (w)) 

= acv + (1 — a)Wcpr a (v) + adw + (1 — a)Wdpr a (w) 
= cpr a {v) + dpr a {w) 
= x' . 

By Proposition 2, the equation has a unique solution, so that x = x' and the result follows. ■ 

Proposition 4 (Commutativity with W) pr a (Ws) = Wpr a {s). 

Proof By definition, the vector x = pr a (Ws) satisfies the following equation 

x = a(cv + dw) + (1 — a)Wx . 

Let us verify that x' = Wpr a (s) satisfies the same equation: 

a(cv + dw) + (1 - a)Wx = aWs + (1 - a)W 2 pr a (s) 

= W(as+(l-a)Wpr a (s)) 
= Wpr a (.s) 
= x' . 

By Proposition 2, the equation has a unique solution, so that x = x' and the result follows. ■ 
As a corollary of Propositions 2 and 4, we deduce that pr a (s) is the unique solution to 

pr Q (s) = as+(l-a)pr a (Ws) . (2) 

3.3 Approximating PageRank 

We would like to come up with a fast way to find an approximation to the unique solution pr a (s) of 
Equation (1). We now describe an iterative procedure that does that. 

We maintain two vectors p, the approximation vector, and r, the error vector, that satisfy the following 
invariant 

P = P r aO - r ) ■ 

Starting with initial values p = and r = s, in each iteration, we pick a vertex u, and update the two vectors 
p and r to the new vectors p' and r' defined as follows: 

p = p + ar(u)xu , 

r' = r - r(u)xu + (1 - oi)r{u)Wxu ■ 

The vector Xu is the characteristic vector of u, i.e., the vector with a 1 in the coordinate corresponding to 
vertex u and elsewhere. Given a fixed e > 0, we keep iterating as long as there exists some vertex u such 
that r(u) > ed(u). 

First, we prove that each iteration of the algorithm preserves the invariant p = pr Q (s — r). 
Proposition 5 p' = pr a (s — r'). 
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Proof By Proposition 3, it suffices to show that p' + pr a (r') = p + pr Q (r). So let us verify that: 

P + P r a ( r ) = P + P r a( r - r(u)xu) + pr a (r(u)x„) 

= P + Pr a ( r - r(u)xu) + ar(u)xu + (1 - a)pr a (Wr(u) X u) 
= (p + ar(u)xu) + pr a (r - r(u)Xu + (1 - a)r(u)WXu) 
= P' + Pr a (r') ■ 

where the third equation resulted from an application of Equation (2). ■ 

Next, we prove a bound on the error vector. 
Proposition 6 ||r'||i < ||r||i — ar(u). 
Proof Using the triangle inequality, 

||r'||i = ||r - r(u) X u + (1 - a)r(u)W X u\\i <\\r- r(u) X u\\i + (1 - a)r(u)\\W X u\\i ■ 
However, HVFxulli < 1. Indeed, the ith element of Wxu is 2 d[u) wnen an d \ when i — u. Therefore, 

||r'||i < ||r||i - r(u) + (1 - a)r(u) = \\r\\i - ar(u) , 

as desired. ■ 

Finally, we prove that the iterative procedure works. 

Theorem 7 Fix e > 0. Suppose that in each iteration we pick a vertex u with the property that r(u) > td(u). 
Then the process terminates in 0(— J iterations with vectors p and r that satisfy the following properties: 

i r(v) ^ 

1. max,, jtfj <e. 

2. vol(supp(p)) < where supp(p) is the set of vertices for which p is nonzero and vo\(S) = X^eS ^x- 

Proof Initially, ||r||i = 1. By Proposition 6, ||r||i decreases at each iteration by ar(u), which by as- 
sumption is at least aed(u). Therefore, since the degree of each vertex is at least 1, ||r||i decreases at each 
iteration by at least ae. We deduce that the algorithm must terminate in at most O(^) iterations. 

Next, by definition, the process terminates when there are no more vertices u such that r(u) > ed(u). 
Therefore, condition (1) is automatically satisfied. 

Moreover, if we let T denote the number of iterations that the algorithm takes to terminate and let 
di denote the degree of the vertex picked in the ith step of the algorithm, then cte^2^ =1 di < 1, so that 
Y^ii=i di < ix- Now note that every vertex in supp(p) must have been picked at least once during the 
execution of the algorithm, so that 

T 1 
vol(supp(p)) < Vd, < — , 
* — ' ea 

i=l 

thus showing (2), and completing the proof of the theorem. ■ 

The theorem we just proved gives the approximation to the PageRank vector that we need, and we finally 
get a local clustering algorithm. Note that to find a cut C we need e = 0(l/vol(C)), so that the running 
time of the process is proportional to vo ^ c ^ . 

In order to obtain from this an almost-linear global partitioning algorithm, we do as follows. Let us 
suppose that <j>(G) is polylog(n). If we pick a random vertex v in a cluster of vertices C with conductance 
<fi 2 , we will find with probability at least 0.5 a set with volume at least vol(C)/2. However, this holds only 
if we use "appropriate" parameters a and e, which we do not know! The fix is to binary search over the 
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possibilities, incurring an additional cost that is only a logarithmic multiplicative factor. In conclusion, we 
can find a globally optimal <j) (up to the usual squaring error times some log factors) by cutting off chunks 
of the graph and repeating. The total running time is almost linear because the running time on each chunk 
is almost linear in its volume. 

Caveat. In a random walk scheme, we need to take l/4> steps in order to get a cut of conductance 
l/\A/>; hence, that takes time that is about (size of chunk) • poly(l/0). Similarly, in a PageRank scheme, 
we need to take 1/a steps in order to get a cut of conductance 1/^/5; again, that takes time that is about 
(size of chunk) • poly(l/<^») . As a consequence, the algorithm will run in time that is almost linear times some 
poly(l/</>), which is almost linear only if <fi is at least polylog(n). Improving this for smaller conductances is 
still an open problem. 

4 Intro to Spar sificat ion 

Sparsification is a technique used in dynamic graph algorithms to reduce the dependence of an algorithm's 
time on the number of edges in a graph. We briefly motivate this technique now, and will discuss it next 
time. 

Suppose that we have a graph G = (V,E) with m — <d(n 2 ) edges. We would like to solve some cut 
problem (e.g., sparsest cut, min cut, s-t min cut). Most algorithms that solve these kinds of problems 
have running times that typically grow with m, the number of edges in the graph. As a consequence, such 
algorithms are much slower for dense graphs than for sparse graphs. 

It would be really nice if we could somehow throw out a lot of edges from G and still get an approximate 
answer, because the running time of the algorithm for the resulting graph will be close to that for a sparse 
graph. More precisely, is there any way to "approximate" our graph G with a sparse graph G' that has the 
property that all of its cuts have more or less the same size as the original graph Gl 

To answer this question, next time we will introduce the idea of randomized sampling. It is not a spectral 
technique, but we will discuss spectral techniques that improve it. 
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